Skip to content

Conversation

@Yanlilyu
Copy link
Contributor

@Yanlilyu Yanlilyu commented Aug 9, 2023

Description:

  • This branch adds a file_based routine to help user find the most I/O intensive files. The file is file_stats.py in PyDarshan CLI tools.
  • This branch also adds a test_file_stats.py to PyDarshan tests to test file_stats.py.
  • It combines the data from multiple log files to a DataFrame, groups the data by “id”, sorts data by the column name the user inputs in a descending order, and then filters the data with the first n (number_of_rows from user input) records. It returns a DataFrame with n most I/O intensive files.
  • User input includes log_path, module, order_by_colname, number_of_rows. The command line arguments are name arguments.
  • log_path should be a list of files or a shell glob.
  • The default values for module, order_by_colname, number_of_rows are “POSIX”, “POSIX_BYTES_READ”, 10, respectively. If users don’t input these values, the tool will use default values.
  • The tool checks if the module is in the list of modules. If not, it prints an error out and exits immediately.
  • order_by_colname should be “{mod}_BYTES_READ” or “{mod}_BYTES_WRITTEN”.
  • The tool also checks if the order_by_colname the user inputs is consistent with the module. For example, if the module and order_by_colname are POSIX and STDIO_ BYTES_READ, there will be an error “Column name should be ‘{mod}_BYTES_READ’ or ‘{mod}_BYTES_WRITTEN’“.
  • Example usage:
    $ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker*.darshan -m STDIO -o STDIO_BYTES_READ -n 5
    $ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker*.darshan
    $ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker_1.darshan darshan_logs/nonmpi_workflow/worker_3.darshan -m STDIO -o STDIO_BYTES_READ -n 5

@Yanlilyu Yanlilyu changed the title Pydarshan file based sorting WIP: Pydarshan file based sorting Aug 10, 2023
@Yanlilyu Yanlilyu changed the title WIP: Pydarshan file based sorting WIP: Pydarshan file_based sorting Aug 10, 2023
@shanedsnyder shanedsnyder force-pushed the pydarshan-file-based-sorting branch from a08b533 to 15d39d9 Compare April 27, 2025 18:08
@shanedsnyder shanedsnyder force-pushed the pydarshan-file-based-sorting branch from 15d39d9 to 05157b0 Compare April 29, 2025 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants